In this project, we will try to build a model to find the optimal neighbourhood for openning a new business. As an example we will specify the business type to be an Italian restaurant.
Since London is huge city with a lot of restraunts, we will try to identify the optimal neighbourhood to open a new Italian restaurant. We will try to identify the optimal neighborhood based on:
Based on teh description of the Business Problem, we will need to get data of:
We will extract the required data as follows:
First, lets import the required libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import re #Library to read config file
import geocoder
import folium
from folium import plugins
from folium.plugins import HeatMap
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import geopy.distance
import configparser #Library to read config file
import json
import requests
import math
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
We will retrieve the list of London neighbourhoods from Wikipedia page List of London boroughs which contain London neighbourhoods and their Geo-Location information and other irrelavant information which will be excluded.
# Funtion to calculate area in meter square from mile square
def convert_mi2_m2(row) :
mi_2 = row["Area (sq mi)"]
m_2 = mi_2 * 2590000
return m_2
# function to calculate Circle radius in meteres from meter square area
def convert_area_radius(row) :
area = row["Area"]
radius = math.sqrt((area / math.pi))
return radius
# read html tables from the Wikipedia page
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_London_boroughs')
print('List of London boroughs loaded')
# we get 2 dataframes so we concat them into one dataframe
df = pd.concat([dfs[0], dfs[1]])
# Caclculate Area and radius in metric measurements
df["Area"] = df.apply(lambda row : convert_mi2_m2(row), axis=1)
df["Area-Radius"] = df.apply(lambda row : convert_area_radius(row), axis=1)
# extract only the columns we are interested in
london_df = df[['Borough', 'Co-ordinates', 'Area', 'Area-Radius']].reset_index(drop=True)
london_df.rename(columns={'Borough' : 'Neighbourhood'}, inplace = True)
# clean Borough from [note 1]
note_regex = re.compile(r"\[note \d]")
london_df['Neighbourhood'] = london_df['Neighbourhood'].str.replace(pat=note_regex, repl='', regex=True)
#Split the Co-ordinates column into 2 columns for DMS and Decimal geo-formats
london_df[['Co-ordinates_DMS', 'Co-ordinates_DEC']] = london_df['Co-ordinates'].str.split(pat=' / ', expand=True)
london_df[['Latitude', 'Longitude']] = london_df['Co-ordinates_DEC'].str.split(expand=True)
# extract and clean Latitude and Longitude from Co-ordinates_DEC
def clean_dec(dec_str):
sign = -1 if re.search('[swSW]', dec_str) else 1
dec_str = re.sub(r'°.', '', dec_str)
dec_str = re.sub(' ', '', dec_str)
dec_str = re.sub(u'\ufeff', '', dec_str)
return sign * (float(dec_str))
london_df['Latitude'] = london_df['Latitude'].apply(clean_dec)
london_df['Longitude'] = london_df['Longitude'].apply(clean_dec)
# extract only needed columns
london_df = london_df[['Neighbourhood', 'Latitude', 'Longitude', 'Area', 'Area-Radius']]
london_df.head()
Each neighborhood in London has different area. We already consider/assume the Longitude and Latitude we got from Wikipedia to be the center of each neighborhood.
As we are going to provide the radius in our Foursquare APIs calls, we associated each neighborhood to a radius value of a circle whose area equals the neighborhood area.
Now, le's get the location of London using geopy API.
def get_geo_location(addr) :
geolocator = Nominatim(user_agent="uk_explorer")
location = geolocator.geocode(addr)
latitude = location.latitude
longitude = location.longitude
#print('The geograpical coordinate of London City are {}, {}.'.format(latitude, longitude))
return [latitude, longitude]
address = 'London, UK'
london_location = get_geo_location(address)
print('Coordinate of {}: {}'.format(address, london_location))
Now we can visualize the London maps and its neighborhood. Each neighborhood center will be surrounded with circls of radius proportional to its area.
map_london = folium.Map(location=london_location, zoom_start=10)
folium.CircleMarker(
london_location,
radius=3,
color='red',
popup=address,
fill=True,
fill_color='red',
fill_opacity=0.6
).add_to(map_london)
for lat, lng, radius, label in zip(london_df.Latitude, london_df.Longitude, london_df['Area-Radius'] , london_df.Neighbourhood) :
folium.CircleMarker(
[lat, lng],
radius=3,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_london)
folium.Circle(
[lat, lng],
radius=radius,
color='Yellow'
).add_to(map_london)
map_london
The Circle coverage of London doesn't seem to be good enough. Let's increase the radius of all neighborhoods with 30% and check again
london_df['Area-Radius'] *= 1.3
map_london = folium.Map(location=london_location, zoom_start=10)
folium.CircleMarker(
london_location,
radius=3,
color='red',
popup=address,
fill=True,
fill_color='red',
fill_opacity=0.6
).add_to(map_london)
for lat, lng, radius, label in zip(london_df.Latitude, london_df.Longitude, london_df['Area-Radius'] , london_df.Neighbourhood) :
folium.CircleMarker(
[lat, lng],
radius=3,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_london)
folium.Circle(
[lat, lng],
radius=radius,
color='Yellow'
).add_to(map_london)
map_london
Now we have better coverage to London. We can use this dataset for our analysis.
Now we need to build a data set of restaurants in each neighborhood. Our focus will be in restaurants in general and specially italian restaurants. For this purpose we will use Foursquare APIs.
First, let's prepare the basic Foursquare connection configurations.
# Read Foursquare authentication info from hidden config file
config = configparser.ConfigParser()
config.read('secrets.cfg')
CLIENT_ID = config['4square_personal']['CLIENT_ID']
CLIENT_SECRET = config['4square_personal']['CLIENT_SECRET']
VERSION = '20201103' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
REQUEST_DEFAULT_PARAMS = dict(
client_id = CLIENT_ID,
client_secret = CLIENT_SECRET,
v=VERSION
)
We will use the explore endpoint to query for restaurants in each neighborhood.
To limit the query result, we will specify parameter section as food. This should limit the result to food related venues. Also, we will utilize the sortByPopularity parameter to get results sorted based on Popularity.
Also, we will use the Venues Details endpoint to get the rating of the Italian restaurants.
def get_Restaurant_Rating(restaurant_id) :
URL_4SQU_DETAIL = 'https://api.foursquare.com/v2/venues/{}'.format(restaurant_id)
detail_params = REQUEST_DEFAULT_PARAMS.copy()
try :
rating = requests.get(url=URL_4SQU_DETAIL, params=detail_params).json()['response']['venue']['rating']
except :
rating = 0.0
return rating
# validate whether the venue category is a competitor restaurant and if it is Italian restaurant
def validate_rest_category(rest_cat_name) :
excluded_categories = ['Café', 'Pub', 'Food Court', 'Fast Food', 'Shop', 'Coffee', 'Bakery', 'Breakfast']
italian_list = ['Italian Restaurant', 'Abruzzo Restaurant', 'Agriturismo', 'Aosta Restaurant', 'Basilicata Restaurant', 'Calabria Restaurant', 'Campanian Restaurant', 'Emilia Restaurant', 'Friuli Restaurant', 'Ligurian Restaurant', 'Lombard Restaurant', 'Malga', 'Marche Restaurant', 'Piadineria', 'Piedmontese Restaurant', 'Puglia Restaurant', 'Romagna Restaurant', 'Roman Restaurant', 'Sardinian Restaurant', 'Sicilian Restaurant', 'South Tyrolean Restaurant', 'Trattoria/Osteria', 'Trentino Restaurant', 'Tuscan Restaurant', 'Umbrian Restaurant', 'Veneto Restaurant']
is_restaurant = True
for excl in excluded_categories :
if excl in rest_cat_name :
is_restaurant = False
break
is_italian = rest_cat_name in italian_list
return is_restaurant, is_italian
def get_neighborhood_restaurants(lat, lng, radius) :
#def get_neighborhood_restaurants(venue) :
URL_4SQU_EXPLORE = 'https://api.foursquare.com/v2/venues/explore'
explore_params = REQUEST_DEFAULT_PARAMS.copy()
explore_params.update({
"ll" : '{}, {}'.format(lat, lng),
"radius" : radius,
"section" : 'food',
#"query" : 'restaurant',
"sortByPopularity" : 1
})
restaurants_list = requests.get(url=URL_4SQU_EXPLORE, params=explore_params).json()['response']['groups'][0]['items']
return restaurants_list
#london_df[['Neighbourhood' == 'Barnet']]
restaurants_df = pd.DataFrame(columns=['Neighbourhood', 'Restaurant_ID', 'Restaurant_Name', 'Category_ID', 'Category_Name', 'Restaurant_Latitude', 'Restaurant_Longitude', 'Is_Italian', 'Rating'])
for lat, lng, radius, neighbourhood in zip(london_df.Latitude, london_df.Longitude, london_df['Area-Radius'] , london_df.Neighbourhood) :
rest_lst = get_neighborhood_restaurants(lat, lng, radius)
for rest_info in rest_lst :
restaurants_id = rest_info['venue']['id']
restaurants_name = rest_info['venue']['name']
rest_cat_id = rest_info['venue']['categories'][0]['id']
rest_cat_name = rest_info['venue']['categories'][0]['name']
rest_lat = rest_info['venue']['location']['lat']
rest_lng = rest_info['venue']['location']['lng']
is_restaurant, is_italian = validate_rest_category(rest_cat_name)
rating = 0.0
if is_restaurant :
if is_italian :
rating = get_Restaurant_Rating(restaurants_id)
restaurants_df = restaurants_df.append({'Neighbourhood' : neighbourhood, 'Restaurant_ID' : restaurants_id, 'Restaurant_Name' : restaurants_name, 'Category_ID' : rest_cat_id, 'Category_Name' : rest_cat_name, 'Restaurant_Latitude' : rest_lat, 'Restaurant_Longitude' : rest_lng, 'Is_Italian' : is_italian, 'Rating' : rating}, ignore_index=True)
restaurants_df.head()
london_df = london_df.merge(restaurants_df, on='Neighbourhood', how='inner')
london_df.head()
Now let's visualize the results on the map of London.
map_london = folium.Map(location=london_location, zoom_start=10)
folium.CircleMarker(
london_location,
radius=3,
color='red',
popup=address,
fill=True,
fill_color='red',
fill_opacity=0.6
).add_to(map_london)
for lat, lng, radius, label, special in zip(london_df.Restaurant_Latitude, london_df.Longitude, london_df.Restaurant_Longitude , london_df.Restaurant_Name, london_df.Is_Italian) :
color = 'blue'
if special :
color = 'green'
folium.CircleMarker(
[lat, lng],
radius=2,
color=color,
popup=label,
fill = True,
fill_color= color,
fill_opacity=0.6
).add_to(map_london)
map_london
This conclude our Data Section.
Now we have the list of restaurants and their locations. Also, we know which restaurant is Italian or not.
In this project we will try to detect neighborhoods in London with low number of restaurants specially with low number of Italian restaurants. Also, since London contains many restaurants, we will consider the rating of the Italian restaurants so we pick neighborhoods with low average rating of Italian restaurants.
In the first secion of the report, we gathered the data that will be used in our report:
In the second section of the report, we will perform some analysis on the data we gathered so we could have better understanding of the distribution of restaurants in London and -if needed- we will enrich the data with any extra information to provide better decision.
In the third section, we will use the K-Means clustering to split London neighborhoods into similar clusters based on the criteria defined above and accordingly, we can decied which neighborhood(s) best to openning a new Italian restaurant.
Let's perform some basic analysis on the data.
london_boroughs_url = 'london_boroughs.json'
with open(london_boroughs_url) as fp :
london_boroughs_json = json.load(fp)
def boroughs_style(feature):
return { 'color': 'blue', 'fill': False }
Let's check the total count of restaurants per neighborhood.
london_df['count'] = 1
london_df[['Neighbourhood', 'count']].groupby(['Neighbourhood']).sum().reset_index().sort_values(by=['count'], ascending=False)
all_heat_count = london_df[['Latitude', 'Longitude', 'count']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist()
map_london = folium.Map(location=london_location, zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_london)
folium.Marker(london_location).add_to(map_london)
HeatMap(data=all_heat_count, radius=50).add_to(map_london)
folium.GeoJson(london_boroughs_json, style_function=boroughs_style, name='geojson').add_to(map_london)
for lat, lng, label in zip(london_df.Latitude, london_df.Longitude, london_df.Neighbourhood) :
folium.CircleMarker(
[lat, lng],
radius=3,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_london)
map_london
Let's check the Italian Restaurants in each neighborhood.
italian_df = london_df[london_df.Is_Italian].copy()
italian_df[['Neighbourhood', 'count']].groupby(['Neighbourhood']).sum().reset_index().sort_values(by=['count'], ascending=False)
ita_heat_count = italian_df[['Latitude', 'Longitude', 'count']].groupby(['Latitude', 'Longitude']).sum().reset_index().values.tolist()
map_london = folium.Map(location=london_location, zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_london)
folium.Marker(london_location).add_to(map_london)
HeatMap(data=ita_heat_count, radius=50).add_to(map_london)
folium.GeoJson(london_boroughs_json, style_function=boroughs_style, name='geojson').add_to(map_london)
for lat, lng, label in zip(london_df.Latitude, london_df.Longitude, london_df.Neighbourhood) :
folium.CircleMarker(
[lat, lng],
radius=3,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_london)
map_london
Let's check the average rating of the Italian restaurants in each neighborhood.
italian_df[['Neighbourhood', 'Rating']].groupby(['Neighbourhood']).mean().reset_index().sort_values(by=['Rating'], ascending=False)
ita_heat_rating = italian_df[['Latitude', 'Longitude', 'Rating']].groupby(['Latitude', 'Longitude']).mean().reset_index().values.tolist()
map_london = folium.Map(location=london_location, zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_london)
folium.Marker(london_location).add_to(map_london)
HeatMap(data=ita_heat_rating, radius=50).add_to(map_london)
folium.GeoJson(london_boroughs_json, style_function=boroughs_style, name='geojson').add_to(map_london)
for lat, lng, label in zip(london_df.Latitude, london_df.Longitude, london_df.Neighbourhood) :
folium.CircleMarker(
[lat, lng],
radius=3,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_london)
map_london
As we can see from the analysis and visualization, we can see that City of London is a hot place for restaurants. This is a typical for most of the cities.
Based on that, we can include calculate the distance to the City of London as an attribute to each neighborhood.
def calculate_distance(row) :
loc = (row['Latitude'], row['Longitude'])
london_loc = (london_df[london_df.Neighbourhood == 'City of London']['Latitude'].iloc[0], london_df[london_df.Neighbourhood == 'City of London']['Longitude'].iloc[0])
return geopy.distance.distance(loc, london_loc).meters
london_df["Distance_To_Center"] = london_df.apply(lambda row : calculate_distance(row), axis=1)
london_df.head()
Now we will prepare the dataset that will be used for our modeling.
For each neighborhood, we will aggregate:
rest_count = london_df[['Neighbourhood', 'count']].groupby(['Neighbourhood']).sum().reset_index()
rest_count.rename(columns={'count' : 'Restaurant_Count'}, inplace = True)
italian_rate = italian_df[['Neighbourhood', 'Rating']].groupby(['Neighbourhood']).mean().reset_index()
italian_rate.rename(columns={'Rating' : 'Average_Ratings'}, inplace= True)
italian_count = italian_df[['Neighbourhood', 'count']].groupby(['Neighbourhood']).sum().reset_index()
italian_count.rename(columns={'count' : 'Italian_Count'}, inplace= True)
london_distance = london_df[['Neighbourhood', 'Distance_To_Center']].groupby(['Neighbourhood', 'Distance_To_Center']).size().reset_index(name='freq').drop('freq', axis=1)
dataset_df = pd.merge(rest_count, london_distance, how='left', on='Neighbourhood')
dataset_df = pd.merge(dataset_df, italian_count, how='left', on='Neighbourhood')
dataset_df = pd.merge(dataset_df, italian_rate, how='left', on='Neighbourhood')
dataset_df = dataset_df.fillna(0)
Now we have the final dataset for our K-Means clustering model. Let's normalize our dataset.
X = dataset_df.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
#cluster_dataset
num_clusters = 5
k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_
print(labels)
Note that each row in our dataset represents a neighborhood, and therefore, each row is assigned a label.
dataset_df['Labels'] = labels
dataset_df.sort_values(by='Labels', ascending=True)
As we can see from the results of our Analysis, London neighborhoods of Labels 0 are the most interesting neighborhoods to open new Italian restaurant because they have the least Italian restaurants with low rating.
Let's check the neighborhoods in cluster 4 and order the result based on the Distance to the City Center.
dataset_df[dataset_df['Labels'] == 0].sort_values(by='Distance_To_Center')
As we can see, neighborhood Camden is the nearest neighborhood to the City Center. However, it has many restaurants which might be a downside because of competition for other restaurants serving other cuisines.
Second in the list is neighborhood Lewisham which might be better option as it has less number of restaurants despite being little far from the City Center compared to Camden
The objective of this report is to identify the neighborhoods in London with least number of restaurants specially Italian restaurants and close to the city center. This should help the report's stackholders to narrow down the neighborhoods to search for openning a new Italian restaurant.
In order to achieve this objective, we gathered geo-data of London neighborhoods and used Foursquare endpoints to gather information about the restaurants in each neighborhood. Also, we used maps visualizaiton to help the user with the data exploration stage.
Based on the neighborhoods clustering we did, we provided 2 recomendations to the end-user to choose from.
For sure, there are many factors that might influence the end-user decision (i.e. other business in the area, easy access and transportation, etc) which is out of the scope of this report.